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Abstract 


ROUGE is a widely adopted, automatic 
evaluation measure for text summariza¬ 
tion. While it has been shown to corre¬ 
late well with human judgements, it is bi¬ 
ased towards surface lexical similarities. 
This makes it unsuitable for the evalua¬ 
tion of abstractive summarization, or sum¬ 
maries with substantial paraphrasing. We 
study the effectiveness of word embed¬ 
dings to overcome this disadvantage of 
ROUGE. Specifically, instead of measur¬ 
ing lexical overlaps, word embeddings are 
used to compute the semantic similarity of 
the words used in summaries instead. Our 
experimental results show that our pro¬ 
posal is able to achieve better correlations 
with human judgements when measured 
with the Spearman and Kendall rank co¬ 
efficients. 


1 Introduction 


Automatic text summarization is a rich field of re¬ 
search. Eor example, shared fask evaluation work¬ 
shops for summarization were held for more than 
a decade in the Document Understanding Con¬ 
ference (DUG), and subsequently the Text Anal¬ 
ysis Conference (TAG). An important element of 
these shared tasks is the evaluation of participat¬ 
ing systems. Initially, manual evaluation was car¬ 
ried out, where human judges were tasked to as¬ 
sess the quality of automatically generated sum¬ 
maries. However in an effort to make evalua¬ 
tion more scaleable, the automatic ROUG^ mea¬ 


sure ( |Ein, 2004b| was introduced in DUC-2004. 
ROUGE determines the quality of an automatic 
summary through comparing overlapping units 
such as n-grams, word sequences, and word pairs 
with human written summaries. 


'Recall-Oriented Understudy of Gisting Evaluation 


ROUGE is not perfect however. Two problems 
with ROUGE are that 1) it favors lexical simi¬ 
larities between generated summaries and model 
summaries, which makes it unsuitable to evaluate 
abstractive summarization, or summaries with a 
significant amount of paraphrasing, and 2) it does 
not make any provision to cater for the readability 
or fluency of the generated summaries. 

There has been on-going efforts to im¬ 
prove on automatic summarization evalua¬ 
tion measures, such as the Automatically 
Evaluating Summaries of Peers (AESOP) 
task in TAG ( |Dang and Owczarzak, 2009 


Owczarzak, 2010t Owczarzak and Dang, 20111. 


However, ROUGE remains as one of the most 
popular metric of choice, as it has repeatedly 
been shown to correlate very well with human 
judgements ( |Ein, 2004at [Over and Yen, 2004 


Owczarzak and Dang, 20111. 


In this work, we describe our efforts to tackle 
the first problem of ROUGE that we have iden¬ 
tified above — its bias towards lexical similari¬ 
ties. We propose to do this by making use of word 
embeddings ( Bengio et ah, 2003] ). Word embed¬ 
dings refer to the mapping of words into a multi¬ 
dimensional vector space. We can construct the 
mapping, such that the distance between two word 
projections in the vector space corresponds to the 
semantic similarity between the two words. By in¬ 
corporating these word embeddings into ROUGE, 
we can overcome its bias towards lexical similar¬ 
ities and instead make comparisons based on the 
semantics of words sequences. We believe that 
this will result in better correlations with human 
assessments, and avoid situations where two word 
sequences share similar meanings, but get unfairly 
penalized by ROUGE due to differences in lexico¬ 
graphic representations. 

As an example, consider these two phrases: 1) 
It is raining heavily, and 2) It is pouring. If we 
are performing a lexical string match, as ROUGE 

















does, there is nothing in common between the 
terms “raining”, “heavily”, and “pouring”. How¬ 
ever, these two phrases mean the same thing. If 
one of the phrases was part of a human written 
summary, while the other was output by an auto¬ 
matic summarization system, we want to be able 
to reward the automatic system accordingly. 

In our experiments, we show that word embed¬ 
dings indeed give us better correlations with hu¬ 
man judgements when measured with the Spear¬ 
man and Kendall rank coefficient. This is a signif¬ 
icant and exciting result. Beyond just improving 
the evaluation prowess of ROUGE, it has the po¬ 
tential to expand the applicability of ROUGE to 
abstractive summmarization as well. 


2 Related Work 


While ROUGE is widely-used, as we have 
noted earlier, there is a significant body of 
work studying the evaluation of automatic text 
summarization systems. A good survey of 
many of these measures has been written by 


Steinberger and Jezek (20121. We will thus not at¬ 


tempt to go through every measure here, but rather 
highlight the more significant efforts in this area. 
Besides ROUGE, Basic Elements 


(BE) (Hovy et ah, 20051 has also been used 
in the DUC/TAC shared task evaluations. It is 
an automatic method which evaluates the content 
completeness of a generated summary by breaking 
up sentences into smaller, more granular units of 
information (referred to as “Basic Elements”). 

The pyramid method originally proposed by 
Passonneau et al. (2005 1 ) is another staple in 
DUC/TAC. However it is a semi-automated 
method, where significant human intervention 
is required to identify units of information, 
called Summary Content Units (SCUs), and 
then to map content within generated summaries 
to these SCUs. Recently however, an auto¬ 
mated variant of this method has been pro¬ 
posed dPassonneau et ah, 201 3| ). In this variant, 
word embeddings are used, as we are proposing 
in this paper, to map text content within generated 
summaries to SCUs. However the SCUs still need 
to be manually identified, limiting this variant’s 
scalability and applicability. 

Many systems have also been proposed 
in the AESOP task in TAC from 2009 to 
2011. Eor example, the top system re¬ 


ported in Owczarzak and Dang (20111, AutoSum 


mENG (Giannakopoulos and Karkaletsis, 20091, 
is a graph-based system which scores summaries 
based on the similarity between the graph struc¬ 
tures of the generated summaries and model sum¬ 
maries. 


3 Methodology 


Eet us now describe our proposal to integrate word 
embeddings into ROUGE in greater detail. 

To start off, we will first describe the word 
embeddings that we intend to adopt. A word 
embedding is really a function W, where W : 
w —> M”, and m is a word or word sequence. 
Eor our purpose, we want W to map two words 
wi and W 2 such that their respective projections 
are closer to each other if the words are se¬ 
mantically similar, and further apart if they are 
not. [Mikolov et al. (20T3] | describe one such vari¬ 
ant, called word2vec, which gives us this de¬ 
sired propertjU. We will thus be making use of 
word2 vec. 


We will now explain how word embed¬ 
dings can be incorporated into ROUGE. There 
are several variants of ROUGE, of which 
ROUGE-1, ROUGE-2, and ROUGE-SU4 
have often been used. This is because they 
have been found to correlate well with human 
judgements ( |Ein, 2004at [Over and Yen, 2004 


Owczarzak and Dang, 20111. ROUGE-1 mea¬ 


sures the amount of unigram overlap between 
model summaries and automatic summaries, 
and ROUGE-2 measures the amount of bigram 
overlap. ROUGE-SU4 measures the amount of 
overlap of skip-bigrams, which are pairs of words 
in the same order as they appear in a sentence. 
In each of these variants, overlap is computed by 
matching the lexical form of the words within the 
target pieces of text. Eormally, we can define this 
as a similarity function fn such that: 


fR{wi,W2) 


1, if Wi = W2 
0, otherwise 


( 1 ) 


where and W 2 are the words (could be unigrams 
or n-grams) being compared. 

In our proposaJl, which we will refer to as 
ROUGE-WE, we define a new similarity function 

^The effectiveness of the leamt mapping is such that we 
can now compute analogies such as king — man -\- woman = 
queen. 

■"‘https : //github . com/ng- j-p/rouge-we 




















fwE such that: 


j 0, if uior V 2 are OOV 

JWe{Wi,W2) = < , ( 2 ) 

I ■ V 2 , Otherwise 

where wi and W 2 are the words being compared, 
and Vx = W{wx)- OOV here means a situation 
where we encounter a word w that our word em¬ 
bedding function W returns no vector for. For the 
purpose of this work, we make use of a set of 3 
million pre-trained vector mapping^ trained from 
part of Google’s news dataset (?) for W. 
Reducing OOV terms for n-grams. With our 
formulation for fwE, we are able to compute 
variants of ROUGE-WE that correspond to those 
of ROUGE, including ROUGE-WE-1, ROUGE- 
WE-2, and ROUGE-WE-SU4. However, despite 
the large number of vector mappings that we have, 
there will still be a large number of OOV terms in 
the case of ROUGE-WE-2 and ROUGE-WE-SU4, 
where the basic units of comparison are bigrams. 

To solve this problem, we can compose in¬ 
dividual word embeddings together. We follow 
the simple multiplicative approach described by 


model summaries, then form the basis of the 
dataset for AES OR 

To assess how effective an automatic evaluation 
system is, the system is first tasked to assign a 
score for each of the summaries generated by all of 
the 51 participating systems. Each of these sum¬ 
maries would also have been assessed by human 
judges using these three key metrics: 

Pyramid. As reviewed in Section |2j this 
is a semi-automated measure described in 
[Passonneau et al. (20051 ). 

Responsiveness. Human judges are tasked to 
evaluate how well a summary adheres to the infor¬ 
mation requested, as well as the linguistic quality 
of the generated summary. 

Readability. Human judges give their judgement 
on how fluent and readable a summary is. 

The evaluation system’s scores are then tested to 
see how well they correlate with the human assess¬ 
ments. The correlation is evaluated with a set of 
three metrics, including 1) Pearson correlation (P), 
2) Spearman rank coefficient (S), and 3) Kendall 
rank coefficient (K). 

4.2 Results 

We evaluate three different variants of our 
proposal, ROUGE-WE-1, ROUGE-WE-2, and 
ROUGE-WE-SU4, against their corresponding 
variants of ROUGE (i.e., ROUGE-1, ROUGE-2, 
ROUGE-SU4). It is worth noting here that in AE¬ 
SOP in 2011, ROUGE-SU4 was shown to corre¬ 
late very well with human judgements, especially 
for pyramid and responsiveness, and out-performs 
most of the participating systems. 

Tables [T] |2l and |3] show the correlation of the 
scores produced by each variant of ROUGE-WE 
with human assessed scores for pyramid, respon¬ 
siveness, and readability respectively. The tables 
also show the correlations achieved by ROUGE-1, 
ROUGE-2, and ROUGE-SU4. The best result for 
each column has been bolded for readability. 

ROUGE-WE-1 is observed to correlate very 
well with the pyramid, responsiveness, and read¬ 
ability scores when measured with the Spear¬ 
man and Kendall rank correlation. However, 
ROUGE-SU4 correlates better with human assess¬ 
ments for the Pearson correlation. The key differ¬ 
ence between the Pearson correlation and Spear¬ 
man/Kendall rank correlation, is that the former 
assumes that the variables being tested are nor- 
"https : //drive . google . com/file/d/OB7XkCwp]lBatlyNdM;tntljate$l21ftariM0darthBrspSHll5QasLtf^lt the 


Mitchell and Eapata (2008 I, where individual vec¬ 
tors of constituent tokens are multiplied together 
to produce the vector for a n-gram, i.e., 

W{w) = W{wi) X ... X W{Wn) (3) 

where m is a n-gram composed of individual word 
tokens, i.e., w = wiW2 ■ ■ ■ Wn- Multiplication be¬ 
tween two vectors W(wi) = {vn, ..., Vik) and 
W(wj) = {vji, ..., Vjk} in this case is defined 
as: 

{Vii X Vji,... ,Vik X Vjk} (4) 

4 Experiments 

4.1 Dataset and Metrics 

Eor our experiments, we make use of the dataset 


used in AESOP (Owczarzak and Dang, 20111, 
and the corresponding correlation measures. 

Eor clarity, let us first describe the dataset used 
in the main TAG summarization task. The main 
summarization dataset consists of 44 topics, each 
of which is associated with a set of 10 docu¬ 
ments. There are also four human-curated model 
summaries for each of these topics. Each of the 
51 participating systems generated a summary for 
each of these topics. These automatically gener¬ 
ated summaries, together with the human-curated 







Measure 

P 

S 

K 

ROUGE-WE-1 

0.9492 

0.9138 

0.7534 

ROUGE-WE-2 

0.9765 

0.8984 

0.7439 

ROUGE-WE-SU4 

0.9783 

0.8808 

0.7198 

ROUGE-1 

0.9661 

0.9085 

0.7466 

ROUGE-2 

0.9606 

0.8943 

0.7450 

ROUGE-SU4 

0.9806 

0.8935 

0.7371 


Measure 

P 

S 

K 

ROUGE-WE-1 

0.7846 

0.4312 

0.3216 

ROUGE-WE-2 

0.7819 

0.4141 

0.3042 

ROUGE-WE-SU4 

0.7931 

0.4068 

0.3020 

ROUGE-1 

0.7900 

0.3914 

0.2846 

ROUGE-2 

0.7524 

0.3975 

0.2925 

ROUGE-SU4 

0.7840 

0.3953 

0.2925 


Table 1: Correlation with pyramid scores, mea¬ 
sured with Pearson r (P), Spearman p (S), and 
Kendall r (K) coefficients. 


Table 3: Correlation with readability scores, mea¬ 
sured with Pearson r (P), Spearman p (S), and 
Kendall r (K) coefficients. 


Measure 

P 

S 

K 

ROUGE-WE-1 

0.9155 

0.8192 

0.6308 

ROUGE-WE-2 

0.9534 

0.7974 

0.6149 

ROUGE-WE-SU4 

0.9538 

0.7872 

0.5969 

ROUGE-1 

0.9349 

0.8182 

0.6334 

ROUGE-2 

0.9416 

0.7897 

0.6096 

ROUGE-SU4 

0.9545 

0.7902 

0.6017 


Table 2: Correlation with responsiveness scores, 
measured with Pearson r (P), Spearman p (S), and 
Kendall r (K) coefficients. 


variables are linearly related to each other. The lat¬ 
ter two measures are however non-parametric and 
make no assumptions about the distribution of the 
variables being tested. We argue that the assump¬ 
tions made by the Pearson correlation may be too 
constraining, given that any two independent eval¬ 
uation systems may not exhibit linearity. 

Looking at the two bigram based variants, 
ROUGE-WE-2 and ROUGE-WE-SU4, we ob¬ 
serve that ROUGE-WE-2 improves on ROUGE-2 
most of the time, regardless of the correlation met¬ 
ric used. This lends further support to our proposal 
to use word embeddings with ROUGE. 

However ROUGE-WE-SU4 is only better than 
ROUGE-SU4 when evaluating readability. It does 
consistently worse than ROUGE-SU4 for pyramid 
and responsiveness. The reason for this is likely 
due to how we have chosen to compose unigram 
word vectors into bigram equivalents. The mul¬ 
tiplicative approach that we have taken worked 
better for ROUGE-WE-2 which looks at contigu¬ 
ous bigrams. These are easier to interpret seman¬ 
tically than skip-bigrams (the target of ROUGE- 
WE-SU4). The latter, by nature of their construc¬ 
tion, loses some of the semantic meaning attached 
to each word, and thus may not be as amenable to 
the linear composition of word vectors. 


Owczarzak and Dang (20111 reports only the 


results of the top systems in AESOP in terms 
of Pearson’s correlation. To get a more com¬ 
plete picture of the usefulness of our proposal, 
it will be instructive to also compare it against 
the other top systems in AESOP, when mea¬ 
sured with the Spearman/Kendall correlations. 
We show in Table |4] the top three systems 
which correlate best with the pyramid score when 
measured with the Spearman rank coefficient. 
C_S_IIITH3 dKumar et al., 201 1| | is a graph- 
based system which assess summaries based on 
differences in word co-locations between gener¬ 
ated summaries and model summaries. BE-HM 
(baseline by the organizers of the AESOP task) 
is the BE system (Hovy et al., 20051, where ba¬ 
sic elements are identified using a head-modifier 
criterion on parse results from Minipar. Eastly, 
catolicascl ( |de Oliveira, 201 1| ) is also a 
graph-based system which frames the summary 
evaluation problem as a maximum bipartite graph 
matching problem. 


Measure 

S 

K 

ROUGE-WE-1 

0.9138 

0.7534 

C_SJIITH3 

0.9033 

0.7582 

BE-HM 

0.9030 

0.7456 

catolicascl 

0.9017 

0.7351 


Table 4: Correlation with pyramid scores of 
top systems in AESOP 2011, measured with the 
Spearman p (S), and Kendall r (K) coefficients. 

We see that ROUGE-WE-1 displays better cor¬ 
relations with pyramid scores than the top system 
in AESOP 2011 {i.e., C_S_IIITH3) when mea¬ 
sured with the Spearman coefficient. The latter 
does slightly better however for the Kendall coef- 
hcient. This observation further validates that our 































proposal is an effective enhancement to ROUGE. 


5 Conclusion 


We proposed an enhancement to the popu¬ 
lar ROUGE metric in this work, ROUGE-WE. 
ROUGE is biased towards identifying lexical sim¬ 
ilarity when assessing the quality of a generated 
summary. We improve on this by incorporat¬ 
ing the use of word embeddings. This enhance¬ 
ment allows us to go beyond surface lexicographic 
matches, and capture instead the semantic similar¬ 
ities between words used in a generated summary 
and a human-written model summary. Experi¬ 
menting on the TAG AESOP dataset, we show that 
this proposal exhibits very good correlations with 
human assessments, measured with the Spear¬ 
man and Kendall rank coefficients. In particular, 
ROUGE-WE-1 outperforms leading state-of-the- 
art systems consistently. 


Eooking ahead, we want to continue building 
on this work. One area to improve on is the 
use of a more inclusive evaluation dataset. The 
AESOP summaries that we have used in our ex¬ 
periments are drawn from systems participating 
in the TAG summarization task, where there is a 
strong exhibited bias towards extractive siimma- 
rizers. It will be helpful to enlarge this set of sum¬ 
maries to include output from summarizers which 
carry out substantial paraphrasing ( |Ei et ah, 2013 
Ng et ah, 2014||Liu et ah, 20T5]|. 


Another immediate goal is to study the use 
of better compositional embedding models. The 
generalization of unigram word embeddings into 
bigrams (or phrases), is still an open prob¬ 
lem dYin and Schiitze, 2014^ |Yu et ah, 2014[ ). A 
better compositional embedding model than the 
one that we adopted in this work should help us 
improve the results achieved by bigram variants of 
ROUGE-WE, especially ROUGE-WE-SU4. This 
is important because earlier works have demon¬ 
strated the value of using skip-bigrams for sum¬ 
marization evaluation. 


An effective and accurate automatic evaluation 
measure will be a big boon to our quest for bet¬ 
ter text summarization systems. Word embeddings 
add a promising dimension to summarization eval¬ 
uation, and we hope to expand on the work we 
have shared to further realize its potential. 
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